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21 Cluster ensembles — a knowledge reuse framework for combining, multiple partitions 
Alexander Strehl, Joydeep Ghosh 

March 2003 The Journal of Machine Learning Research, volume 3 
Publisher: MIT Press 

Full text available- f^l pdf(842 50 KB) A^' 110 " 31 Information: full citation, abstract, references, citings, index 
' ^ terms 

This paper introduces the problem of combining multiple partitionings of a set of objects 
into a single consolidated clustering without accessing the features or algorithms that 
determined these partitionings. We first identify several application scenarios for the 
resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster 
ensemble problem is then formalized as a combinatorial optimization problem in terms of 
shared mutual information. In addition to a direct ... 

Keywords: cluster analysis, clustering, consensus functions, ensemble, knowledge reuse, 
multi-learner systems, mutual information, partitioning, unsupervised learning 



22 Research track: Mining concept-drifting data streams using ensemble classifiers 
Haixun.Wang, Wei Fan, Philip S. Yu, Jiawei Han 

August 2003 Proceedings of the ninth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

Additional Information: full citation, abstract, references, citings, index 
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Full text available: «g) pdf(234 .13 KB) 



Recently, mining data streams with concept drifts for actionable insights has become an 
important and challenging task for a wide range of applications including credit card fraud 
protection, target marketing, network intrusion detection, etc. Conventional knowledge 
discovery tools are facing two challenges, the overwhelming volume of the streaming 
data, and the concept drifts. In this paper, we propose a general framework for mining 
concept-drifting data streams using weighted ensemble classifi ... 

Keywords: classifier, classifier ensemble, concept drift, data streams 
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Boriana L Milenova, Joseph S. Yarmus, Marcos M. Campos 

August 2005 Proceedings of the 31st international conference on Very large data 
bases VLDB 05 

Publisher: VLDB Endowment 

Full text available: ^.pdf( 190..75. KB) Additional Information: full citation, abstract, references, index terms 

Contemporary commercial databases are placing an increased emphasis on analytic 
capabilities. Data mining technology has become crucial in enabling the analysis of large 
volumes of data. Modern data mining techniques have been shown to have high accuracy 
and good generalization to novel data. However, achieving results of good quality often 
requires high levels of user expertise. Support Vector Machines (SVM) is a powerful state- 
of-the-art data mining algorithm that can address problems not amen ... 



24 Magical thinking in data mining: lessons from ColL challenge 2000 
Charles Elkan 

August 2001 Proceedings of the seventh ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

Full text available- ^1 pdf(602 56 KB) Add ' t ' ona ' information: full citation, abstract, references, citings, index 
^ terms 

CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The 
authors of 29 entries later wrote explanations of their work. This paper discusses these 
reports and reaches three main conclusions. First, naive Bayesian classifiers remain 
competitive in practice: they were used by both the winning entry and the next best 
entry. Second, identifying feature interactions correctly is important for maximizing 
predictive accuracy: this was the difference between the winning classi ... 



25 Research sessions: query processing II: Efficient k-NN search on vertically 
decomposed data 

Arjen P. de Vries, Nikos Mamoulis, Niels Nes, Martin Kersten 

June 2002 Proceedings of the 2002 ACM SIGMOD international conference on 
Management of data 

Publisher: ACM Press 

Full text available: Qpcif( 1.26 MB) Additional Information: full citation, abstract, references, index terms 

Applications like multimedia retrieval require efficient support for similarity search on 
large data collections. Yet, nearest neighbor search is a difficult problem in high 
dimensional spaces, rendering efficient applications hard to realize: index structures 
degrade rapidly with increasing dimensionality, while sequential search is not an attractive 
solution for repositories with millions of objects. This paper approaches the problem from 
a different angle. A solution is sought in an unconvent ... 



26 Differentiating data- and text-mining terminology 
Jan H. Kroeze, Machdel C. Matthee, Theo J. D. Bothma 

September 2003 Proceedings of the 2003 annual research conference of the South 

African institute of computer scientists and information technologists 
on Enablement through technology SAICSIT '03 

Publisher: South African Institute for Computer Scientists and Information Technologists 

Full text available: *Q pdf(121,29 KB) Additional Information: full citation, abstract, references, index terms 

When a new discipline emerges it usually takes some time and lots of academic discussion 
before concepts and terms get standardised. Such a new discipline is text mining. In a 
groundbreaking paper, <i>Untangling text data mining</i>, Hearst [1999] tackled the 
problem of clarifying text-mining concepts and terminology. This essay aims to build on 
Hearst's ideas by pointing out some inconsistencies and suggesting an improved and 
extended categorisation of data- and text-mining tech ... 
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Keywords: IR, KDD, TDM, algorithms, database queries, documentation, full-text 
retrieval, information retrieval, knowledge creation, knowledge discovery, knowledge 
management, languages, measurement, metadata, text data mining, text mining, text- 
mining, theory 



27 Research track poster: Estimating missed actual positives using independent 
classifiers 

Sandeep Mane, Jaideep Srivastava, San-Yih Hwang 

August 2005 Proceeding of the eleventh ACM SIGKDD international conference on 
Knowledge discovery in data mining KDD '05 

Publisher: ACM Press 

Full text available: 1 ^ pdf(760. 77 KB) Additional Information: full citation, abstract, references, index terms 

Data mining is increasingly being applied in environments having very high rate of data 
generation like network intrusion detection [7], where routers generate about 300,000 
500,000 connections every minute. In such rare class data domains, the cost of missing a 
rare-class instance is much higher than that of other classes. However, the high cost for 
manual labeling of instances, the high rate at which data is collected as well as real-time 
response constraints do not always allow one to dete ... 

Keywords: capture -recapture method, conditional independence of classifiers given class 
label, conditional independence of features given class label, conditional mutual 
information, false negative 



28 Special issue on special feature: A divisive information theoretic feature clustering 
algorithm for text classification 

Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar 

March 2003 The Journal of Machine Learning Research, volume 3 

Publisher: MIT Press 

Full text available:^ edf( 17 1,07 KB) Additional Information: fyiLcitatign, abstract, citings, index terms 

High dimensionality of text can be a deterrent in applying complex learners such as 
Support Vector Machines to the task of text classification. Feature clustering is a powerful 
alternative to feature selection for reducing the dimensionality of text data. In this paper 
we propose a new information-theoretic divisive algorithm for feature/word clustering and 
apply it to text classification. Existing techniques for such "distributional clustering" of 
words are agglomerative in nature and result in ... 

29 NASA workshop on issues in the application of data mining to scientific data 
Jeanne Behnke, Elaine Dobinson 

^ June 2000 ACM SIGKDD Explorations Newsletter volume 2 issue l 

Publisher: ACM Press 

Full text available: 1 ^ pdf( 1.08 MB) Additional Information: full citation, index terms 



Keywords: NASA, data mining, earth science, statistics 
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Full text available:^ pdf( 1.03 MB) Additional Information: full citation, abstract, references, index terms 

Text classification is a major data mining task. An advanced text classification technique is 
known as partially supervised text classification, which can build a text classifier using a 
small set of positive examples only. This leads to our curiosity whether it is possible to 
find a set of features that can be used to describe the positive examples. Therefore, users 
do not even need to specify a set of positive examples. As the first step, in this paper, we 
formalize it as a new problem, called ... 



31 Subspace clustering for high dimensional data: a review 
Lance Parsons, Ehtesham Haque, Huan Liu 

June 2004 ACM SIGKDD Explorations Newsletter volume 6 issue l 
Publisher: ACM Press 

Full text available: Q pdf(539. 13 KB) Additional Information: full citation, abstract, references 

Subspace clustering is an extension of traditional clustering that seeks to find clusters in 
different subspaces within a dataset. Often in high dimensional data, many dimensions 
are irrelevant and can mask existing clusters in noisy data. Feature selection removes 
irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering 
algorithms localize the search for relevant dimensions allowing them to find clusters that 
exist in multiple, possibly overlapping subspaces. The ... 

Keywords: clustering survey, high dimensional data, projected clustering, subspace 
clustering 



32 Special issue on special feature: An extensive empirical study of feature selection 
metrics for text classification 

George Forman 

March 2003 The Journal of Machine Learning Research, volume 3 
Publisher: MIT Press 

Full text available: 1 ^ pdf(270. 38 KB) Additional Information: full citation, abstract, citings, index terms 

Machine learning for text classification is the cornerstone of document categorization, 
news filtering, document routing, and personalization. In text domains, effective feature 
selection is essential to make the learning task efficient and more accurate. This paper 
presents an empirical comparison of twelve feature selection methods (e.g. Information 
Gain) evaluated on a benchmark of 229 text classification problem instances that were 
gathered from Reuters, TREC, OHSUMED, etc. The results are a ... 

33 Content-based retrieval for multimedia databases; A unified framework for image 
^ database clustering and retrieval 

^ Mei-Ling Shyu, Shu-Ching Chen, Min Chen, Chengcui Zhang 

November 2004 Proceedings of the 2nd ACM international workshop on Multimedia 

databases 
Publisher: ACM Press 

Full text available:^ pdf{291 .68 KB) Additional Information: full citation, abstract, references, index terms 

With the proliferation of image data, the need to search and retrieve images efficiently 
and accurately from a large image database or a collection of image databases has 
drastically increased. To address such a demand, a unified framework called <i>Markov 
Model Mediators</i> (MMMs) is proposed in this paper to facilitate conceptual database 
clustering and to improve the query processing performance by analyzing the summarized 
knowledge. The unique characteristics of MMMs are that it ... 

Keywords: Markov model mediators (MMMs), content-based image retrieval (CBIR), 
image database clustering 
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34 Systems support for scalable data mining 
^ William A. Maniatty, Mohammed J. Zaki 

^ December 2000 ACM SIGKDD Explorations Newsletter, Volume 2 issue 2 
Publisher: ACM Press 

Full text available:^ p df(1. 13 MB ) Additional Information: full. citation, index terms 



Keywords: KDD, data mining, large data sets, parallelism 



35 Text classification: Enhanced word clustering for hierarchical text classification 

Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar 
>^ July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

[—■■I * a , ;/nnon71/m Additional Information: full citation, abstract, references, citings, index 

Full text available: TaI pdf(993. 07 KB) 
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In this paper we propose a new information-theoretic divisive algorithm for word 
clustering applied to text classification. In previous work, such "distributional clustering" 
of features has been found to achieve improvements over feature selection in terms of 
classification accuracy, especially at lower number of features [2, 28]. However the 
existing clustering techniques are agglomerative in nature and result in (i) sub-optimal 
word clusters and (ii) high computational cost. In order to expli ... 

Multivariate discretization of continuous variables for set mining 
^C^tephen D. Bay 

>^ August 2000 Proceedings of the sixth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

Full text available: 1 ^ pdf(1 84.86 KB) Additional Information: full citation, references, index terms 



37 Fast supervised dimensionality reduction algorithm with applications to document 
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November 2000 Proceedings of the ninth international conference on Information and 
knowledge management 
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21 Cluster ensembles — a knowledg e reu se framework for co mbinin g multiple partitions 
Alexander Strehl, Joydeep Ghosh 

March 2003 The Journal of Machine Learning Research, volume 3 
Publisher: MIT Press 

Additional Information: full citation, abstract , references, citings, index 
terms 



Full text available: 1 ^ pdf( 842 .50 KB ) 



This paper introduces the problem of combining multiple partitionings of a set of objects 
into a single consolidated clustering without accessing the features or algorithms that 
determined these partitionings. We first identify several application scenarios for the 
resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster 
ensemble problem is then formalized as a combinatorial optimization problem in terms of 
shared mutual information. In addition to a direct ... 

Keywords: cluster analysis, clustering, consensus functions, ensemble, knowledge reuse, 
multi-learner systems, mutual information, partitioning, unsupervised learning 



22 Research track: Mining concept-drifting data streams using ensemble classifiers 
Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han 

August 2003 Proceedings of the ninth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

Additional Information: full citation, abstract, references, citings, index 
terms 



Full text available: "Q_pdf(234. 1 3 KB) 



Recently, mining data streams with concept drifts for actionable insights has become an 
important and challenging task for a wide range of applications including credit card fraud 
protection, target marketing, network intrusion detection, etc. Conventional knowledge 
discovery tools are facing two challenges, the overwhelming volume of the streaming 
data, and the concept drifts. In this paper, we propose a general framework for mining 
concept-drifting data streams using weighted ensemble classifi ... 

Keywords: classifier, classifier ensemble, concept drift, data streams 
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Boriana L. Milenova, Joseph S. Yarmus, Marcos M. Campos 

August 2005 Proceedings of the 31st international conference on Very large data 
bases VLDB '05 

Publisher: VLDB Endowment 

Full text available: ^| pdf( 190.75 KB) Additional Information: full citation , abstra ct, re ferences , indexjerms 

Contemporary commercial databases are placing an increased emphasis on analytic 
capabilities. Data mining technology has become crucial in enabling the analysis of large 
volumes of data. Modern data mining techniques have been shown to have high accuracy 
and good generalization to novel data. However, achieving results of good quality often 
requires high levels of user expertise. Support Vector Machines (SVM) is a powerful state- 
of-the-art data mining algorithm that can address problems not amen ... 

24 Magical thinking jn data . mining : Jessons from ColL challenge 2000 j 
Charles Elkan 

>^ August 2001 Proceedings of the seventh ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

Full text available:® pdj(602.56 KB) Addjtional ,nformation: MfiiMiOQ. abstract, references, citings, index 
^ " ' terms 

ColL challenge 2000 was a supervised learning contest that attracted 43 entries. The 
authors of 29 entries later wrote explanations of their work. This paper discusses these 
reports and reaches three main conclusions. First, naive Bayesian classifiers remain 
competitive in practice: they were used by both the winning entry and the next best 
entry. Second, identifying feature interactions correctly is important for maximizing 
predictive accuracy: this was the difference between the winning classi ... 

25 Research sessions: query processing II: E fficient k-NN search on vertically 
decomposed data 

Arjen P. de Vries, Nikos Mamoulis, Niels Nes, Martin Kersten 
June 2002 Proceedings of the 2002 ACM SIGMOD international conference on 
Management of data 

Publisher: ACM Press 

Full text available:^ pdf( 1.26 MB) Additional Information: full citation, abstract, references, index terms 

Applications like multimedia retrieval require efficient support for similarity search on 
large data collections. Yet, nearest neighbor search is a difficult problem in high 
dimensional spaces, rendering efficient applications hard to realize: index structures 
degrade rapidly with increasing dimensionality, while sequential search is not an attractive 
solution for repositories with millions of objects. This paper approaches the problem from 
a different angle. A solution is sought in an unconvent ... 

26 D i ff e re ntiating data- and text-mini ng terminology 
Jan H. Kroeze, Machdel C. Matthee, Theo J. D. Bothma 

September 2003 Proceedings of the 2003 annual research conference of the South 

African institute of computer scientists and information technologists 
on Enablement through technology SAICSIT '03 

Publisher: South African Institute for Computer Scientists and Information Technologists 

Full text available: t g| pdf(121 .29 KB) Additional Information: full citati on, abstract, references , index terms 

When a new discipline emerges it usually takes some time and lots of academic discussion 
before concepts and terms get standardised. Such a new discipline is text mining. In a 
groundbreaking paper, <i>Untangling text data mining</i>, Hearst [1999] tackled the 
problem of clarifying text-mining concepts and terminology. This essay aims to build on 
Hearst's ideas by pointing out some inconsistencies and suggesting an improved and 
extended categorisation of data- and text-mining tech ... 
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Keywords: IR, KDD, TDM, algorithms, database queries, documentation, full-text 
retrieval, information retrieval, knowledge creation, knowledge discovery, knowledge 
management, languages, measurement, metadata, text data mining, text mining, text- 
mining, theory 
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Sandeep Mane, Jaideep Srivastava, San-Yih Hwang 

August 2005 Proceeding of the eleventh ACM SIGKDD international conference on 
Knowledge discovery in data mining KDD '05 

Publisher: ACM Press 

Full text available:^ pdf(760.77_KB) Additional Information: fulLcitation, abstract, references, index terms 

Data mining is increasingly being applied in environments having very high rate of data 
generation like network intrusion detection [7], where routers generate about 300,000 - 
500,000 connections every minute. In such rare class data domains, the cost of missing a 
rare-class instance is much higher than that of other classes. However, the high cost for 
manual labeling of instances, the high rate at which data is collected as well as real-time 
response constraints do not always allow one to dete ... 

Keywords: capture-recapture method, conditional independence of classifiers given class 
label, conditional independence of features given class label, conditional mutual 
information, false negative 




28 Special issue on special feature: A divisive information theoretic feature clustering 

algorithm for text classification 

Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar 

March 2003 The Journal of Machine Learning Research, volume 3 

Publisher: MIT Press 

Full text available: ^ pdf( 171.07 KB ) Additional Information: full citation , abstract , citings, index terms 

High dimensionality of text can be a deterrent in applying complex learners such as 
Support Vector Machines to the task of text classification. Feature clustering is a powerful 
alternative to feature selection for reducing the dimensionality of text data. In this paper 
we propose a new information-theoretic divisive algorithm for feature/word clustering and 
apply it to text classification. Existing techniques for such "distributional clustering" of 
words are agglomerative in nature and result in ... 

29 NASA workshop on issues in the application of data mining to scientific data 
^ Jeanne Behnke, Elaine Dobinson 

June 2000 ACM SIGKDD Explorations Newsletter volume 2 issue l 

Publisher: ACM Press 

Full text available: ^.pdf(1..08 MB) Additional Information: full citation, index terms 



Keywords: NASA, data mining, earth science, statistics 
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Full text available: 1 ^ pdf(1 .03 MB) Additional Information: full citation, abstract, references, index terms 

Text classification is a major data mining task. An advanced text classification technique is 
known as partially supervised text classification, which can build a text classifier using a 
small set of positive examples only. This leads to our curiosity whether it is possible to 
find a set of features that can be used to describe the positive examples. Therefore, users 
do not even need to specify a set of positive examples. As the first step, in this paper, we 
formalize it as a new problem, called ... 

31 Subspace clustering for. high dimensional ..data: a review 
A. Lance Parsons, Ehtesham Haque, Huan Liu 

^ June 2004 ACM SIGKDD Explorations Newsletter volume 6 issue l 
Publisher: ACM Press 

Full text available: 1 ^ pdf( 539. 13 KB ) Additional Information: fuJLcitation, abstract, references 

Subspace clustering is an extension of traditional clustering that seeks to find clusters in 
different subspaces within a dataset. Often in high dimensional data, many dimensions 
are irrelevant and can mask existing clusters in noisy data. Feature selection removes 
irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering 
algorithms localize the search for relevant dimensions allowing them to find clusters that 
exist in multiple, possibly overlapping subspaces. The ... 

Keywords: clustering survey, high dimensional data, projected clustering, subspace 
clustering 




32 Special issue on . special feature : An extensiye„ empirical. study of feature selection 
metrics for text classification 

George Forman 

March 2003 The Journal of Machine Learning Research, volume 3 
Publisher: MIT Press 

Full text available: ^.pdf(270, 38 KB) Additional Information: full citation, abstract, citings, index terms 

Machine learning for text classification is the cornerstone of document categorization, 
news filtering, document routing, and personalization. In text domains, effective feature 
selection is essential to make the learning task efficient and more accurate. This paper 
presents an empirical comparison of twelve feature selection methods (e.g. Information 
Gain) evaluated on a benchmark of 229 text classification problem instances that were 
gathered from Reuters, TREC, OHSUMED, etc. The results are a ... 

33 Co ntent-bas ed retrieval for multimedia databases: A unified frame wo rk for image 
^ database cl usterin g a nd content-based retrie val 

^ Mei-Ling Shyu, Shu-Ching Chen, Min Chen, Chengcui Zhang 

November 2004 Proceedings of the 2nd ACM international workshop on Multimedia 

databases 
Publisher: ACM Press 

Full text available: 1 ^ pdf(291 .68 KB) Additional Information: full. citation, abstract, references, index terms 

With the proliferation of image data, the need to search and retrieve images efficiently 
and accurately from a large image database or a collection of image databases has 
drastically increased. To address such a demand, a unified framework called <i>Markov 
Model Mediators</i> (MMMs) is proposed in this paper to facilitate conceptual database 
clustering and to improve the query processing performance by analyzing the summarized 
knowledge. The unique characteristics of MMMs are that it ... 

Keywords: Markov model mediators (MMMs), content-based image retrieval (CBIR), 
image database clustering 
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July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Publisher: ACM Press 

Full text available: ^|pdf(993.07 KB) Additlonal Information: full citation, abstract, references, citings, index 

terms 

In this paper we propose a new information-theoretic divisive algorithm for word 
clustering applied to text classification. In previous work, such "distributional clustering" 
of features has been found to achieve improvements over feature selection in terms of 
classification accuracy, especially at lower number of features [2, 28]. However the 
existing clustering techniques are agglomerative in nature and result in (i) sub-optimal 
word clusters and (ii) high computational cost. In order to expli ... 
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